Introduction

America is a highly diverse country. It is not only diverse in terms of ethnicity, but also in terms of income, industry, and law. This opens the doors for a variety of possible interactions between these variables. What factors drive the way that income is distributed in the United States? What factors reliably predict whether the average income per capita in a specific area is high or low? How does state level variations in law and freedom impact income?

How are these questions SMART?

These questions are important because they tell many facets of the story of consumption in the United States. Income serves both as a measure of productivity and lifetime consumption (although this analysis does not disentangle the two). Although their scope is broad, they remain specific to the concepts of income, demographics, and freedom, and maintain a consistent structure: how do demographics and freedom drive income in the United States, at the census tract level.

These questions also correspond to a set of highly measurable (And luckily, premeasured) variables. Income can be imputed from tax records, while ethnicity and work status are available from census forms. Achieving the answers to the questions is made simple by the cleanliness and availability of this data; since few data points are missing across all census tracts throughout the 50 states of interest, it is simple to form statistical tests.

Finally, these questions are relevant to policy makers who want to improve the incomes of their constituents as well as to researchers interested in establishing a baseline for the average income they should expect a community would earn based on its demographics. These are critical questions, because the ability of communities to support themselves economically has massive impacts on the wellbeing of their members.

Content

First, an examination is conducted on how the US Census Bureau database is structured, and which variables were included. Secondly, the groups of independent variables and how each of them could affect the income per capita of a community is presented. Then, an exploratory data analysis and some statistical tests are made to evaluate the significance of our variables. Finally, a conclusion looks into further challenges and questions necessary to enhance future analyses.

Dataset

U.S. Census Bureau Dataset

The U.S. Census Bureau Data holds the yearly American Community Survey: a project which asks Americans around the country about several dimensions of their lives, including work, income, demographics, and other activities (U.S. Census Bureau, 2019). The dataset from 2015 was available via Kaggle (MuonNeutrino, 2015), and included more than 74,000 observations, with 37 columns (variables). The dataset includes two variables related to income: the median household income and income per capita. The variable income per capita was prefered because it adjusts per person, and not per household given that it’s unknown how many people can live in an average household. The variable income per capita (IncomePerCap) is calculated as the average income per capita of the population of a specific census tract. But, what is a census tract and why use them?

Census tracts

Household’s income in America varies significantly by geographical location. The richest counties in the country are concentrated in urban areas near big metropolises where most businesses are located. The bay area in northern California, Northeast Virginia and New York are some examples. However, counties have been an insufficient unit to compare different variables among them. There are 3,142 counties in a country of 300 million inhabitants (U.S. Census Bureau, 2019), but among them are several inconsistencies. Texas, for example, has 254 counties (U.S. Census Bureau, 2017). California, a state with approximately 10 million people more than Texas, has only 58 counties (U.S. Census Bureau, 2017). Population-wise California has the largest county in the country with more than 10 million inhabitants (Los Angeles), whereas Texas has more than 80 counties with less than 10,000 people (U.S. Census Bureau, 2017). Density-wise, New York has 4 of 5 of the most dense counties in the country, some of them 60,000 times more dense than counties in Hawaii, Alaska or Nevada (U.S. Census Bureau, 2013). As a response to these inconsistencies found in counties in America, the U.S. Census Bureau delineated “Census Tracts” at the beginning of the twentieth century. A census tract is “geographic region defined for the purpose of taking a census.” Over the years, the U.S. Census Bureau has established census tracts in every county in America. There are over 74,000 census tracts in the country and a typical one has around 4,000 or so residents. There is a strength that comes from this consistency: census tracts are by and large similar in population size, and the population size of census tracts does not vary much from state to state.

Description of Variables

The complete dataset includes 17 independent variables and 1 dependent variable. Thanks to their nature, the independent variables were classified in three groups: Work Variation and Ethnic Variation.

Work Variation:

Professional: Percentage (%) employed in management, business, science, and arts in a census tract.

Service: Percentage (%) employed in service jobs in a census tract.

Office: Percentage (%) employed in sales and office jobs in a census tract.

Construction: Percentage (%) employed in natural resources, construction, and maintenance in a census tract.

Production: Percentage (%) employed in production, transportation, and material movement in a census tract.

Unemployed: Unemployment rate (%) in a census tract.

Self-employed: Percentage (%) self-employed in a census tract.

Ethnic Variation

Native: Percentage (%) of population that is Native American or Native Alaskan in a census tract.

White: Percentage (%) of population that is white in a census tract.

Black: Percentage (%) of population that is black in a census tract.

Hispanic: Percentage (%) of population that is Hispanic/Latino in a census tract.

Asian: Percentage (%) of population that is Asian in a census tract.

EDA

Population Histogram and QQ

[1] 0

Outliers identified: 3589 
Propotion (%) of outliers: 5.2 
Mean of the outliers: 74140.26 
Mean without removing outliers: 28491.23 
Mean if we remove outliers: 26139.73 
Outliers successfully removed 

[1] "4369.254"
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      5    2929    4086    4369    5460   53812 
[1] 2095.011

A baseline analysis of population and income was conducted. The histogram for population appeared skewed to the right. The different census tracts had similar population counts with a mean of about 4000. Counties were not evenly spread out as some had a population of 1 million and others 10 million. With similar populations, census tracts were easier to investigate instead of counties. The Q-Q plot confirmed the non-normality as the values between quartiles 3 and 4 were far away from the line.

Income Histogram and QQ

[1] 3589
[1] 0
[1] 69672
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    128   18776   24730   26140   32247   56040 
[1] 10274.98

Individual EDA of Work Variations

 Factor w/ 4 levels "[0,5.3]","(5.3,7.9]",..: 2 4 2 3 1 3 3 3 3 2 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   5.300   7.900   9.251  11.600 100.000     101 

 Factor w/ 4 levels "[0,23.7]","(23.7,31.7]",..: 3 1 2 2 4 2 1 4 2 2 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00   23.70   31.70   33.23   41.80  100.00     105 

 Factor w/ 4 levels "[0,20.3]","(20.3,23.9]",..: 2 2 2 3 1 4 3 4 3 1 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00   20.30   23.90   24.12   27.70  100.00     105 

 Factor w/ 4 levels "[0,14.1]","(14.1,18.3]",..: 2 4 4 3 2 2 4 1 2 1 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00   14.10   18.30   19.65   24.00  100.00     105 

 Factor w/ 4 levels "[0,5.4]","(5.4,8.7]",..: 3 3 3 2 1 2 3 2 2 3 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   5.400   8.700   9.636  12.800 100.000     105 

 Factor w/ 4 levels "[0,7.7]","(7.7,12.3]",..: 3 4 3 3 3 3 3 1 3 4 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00    7.70   12.30   13.36   17.80  100.00     105 

 Factor w/ 4 levels "[0,3.5]","(3.5,5.4]",..: 2 3 4 1 2 3 1 3 2 3 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   3.500   5.400   6.109   7.900 100.000     105 

Next the seven variables for work variations (professional, production, unemployment, office, service, construction, self-employed) were assessed for normality. The boxplots that exhibited a decrease in income, as more of the specific work variation was included in the census tract, were unemployment, service, construction, and production. That is to say, as more unemployed individuals were accounted for in a given census tract, the income per capita decreased. The only work variation that exhibited an increase in average income was professional work. The remaining variables of office and self-employed remained relatively stable across quartiles. Looking at the histograms of each of the variables it appeared that only the proportion of professionals was distributed normally. The remaining six work variations were all skewed to the right. For professionals, the Q-Q plots affirmed the normality as the plot did not have the error terms straying far from the line with very small right and left tails. The same cannot be said for the other variables as each had an oversized right tail and a relatively small left tail. Overall the proportion of professionals appeared normally distributed while the other work variations did not.

Individual EDA of ethnicities

 Factor w/ 4 levels "[0,0.8]","(0.8,4]",..: 3 4 4 2 4 3 4 3 3 3 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.80    4.00   13.78   15.32  100.00 

 Factor w/ 4 levels "[0,2.4]","(2.4,7.2]",..: 1 1 1 3 1 3 2 1 1 1 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    2.40    7.20   17.36   21.50  100.00 

 Factor w/ 4 levels "[0,0.1]","(0.1,1.2]",..: 2 3 3 1 3 1 1 1 1 2 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.100   1.200   4.347   4.400  91.300 

 Factor w/ 4 levels "[0,37.1]","(37.1,70.3]",..: 3 2 3 3 2 3 3 3 4 3 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   37.10   70.30   61.24   88.40  100.00 

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  0.0000   0.0000   0.0000   0.7567   0.4000 100.0000 

Finally the five ethnic variables (Native, White, Black, Hispanic, and Asian) were investigated. The boxplots for White showed an increase in average income between the first second and third quartiles but no change in the fourth. The boxplot for Asian showed an increase from the first through the fourth quartile. The boxplots for Hispanic slightly increased between the first and second quartile but did not change for the third quartile. The fourth quantile for Hispanic decreased significantly. The boxplot for Black increased in average income between the first and second quartile. Then there was a decrease in average income from the second to the fourth quartiles. Overall, it appeared that average income did change based on concentration of ethnicities in a census tract. The histogram for White was bimodal with the highest frequency at over 8,000. The histograms for the other four ethnicities were skewed to the right. Based on the histogram, it appeared that white had the highest responses followed by Hispanic, Black, Asian, and Native. All of the error terms along the Q-Q plot line for each of the ethnicity variables followed a curve with large left and right tails. Also, there were not enough responses from the Native ethnicity to construct a meaningful boxplot. For the native Q-Q plot, there was a clear pattern of the error terms along the line implying non-normality. Therefore, based on the assessment of the boxplots, histograms, and Q-Q plots, none of the ethnicities appear normally distributed.

'data.frame':   69672 obs. of  11 variables:
 $ Hispanic    : num  0.9 0.8 0 10.5 0.7 13.1 3.8 1.3 1.4 0.4 ...
 $ White       : num  87.4 40.4 74.5 82.8 68.5 72.9 74.5 84 89.5 85.5 ...
 $ Black       : num  7.7 53.3 18.6 3.7 24.8 11.9 19.7 10.7 8.4 12.1 ...
 $ Asian       : num  0.6 2.3 1.4 0 3.8 0 0 0 0 0.3 ...
 $ Professional: num  34.7 22.3 31.4 27 49.6 24.2 19.5 42.8 31.5 29.3 ...
 $ Service     : num  17 24.7 24.9 20.8 14.2 17.5 29.6 10.7 17.5 13.7 ...
 $ Office      : num  21.3 21.5 22.1 27 18.2 35.4 25.3 34.2 26.1 17.7 ...
 $ Construction: num  11.9 9.4 9.2 8.7 2.1 7.9 10.1 5.5 7.8 11 ...
 $ Production  : num  15.2 22 12.4 16.4 15.8 14.9 15.5 6.8 17.1 28.3 ...
 $ Unemployment: num  5.4 13.3 6.2 10.8 4.2 10.9 11.4 8.2 8.7 7.2 ...
 $ IncomePerCap: int  25713 18021 20689 24125 27526 30480 20442 32813 24028 24710 ...
[1] 626
[1] 0
[1] 11
'data.frame':   69567 obs. of  11 variables:
 $ Hispanic    : num  0.9 0.8 0 10.5 0.7 13.1 3.8 1.3 1.4 0.4 ...
 $ White       : num  87.4 40.4 74.5 82.8 68.5 72.9 74.5 84 89.5 85.5 ...
 $ Black       : num  7.7 53.3 18.6 3.7 24.8 11.9 19.7 10.7 8.4 12.1 ...
 $ Asian       : num  0.6 2.3 1.4 0 3.8 0 0 0 0 0.3 ...
 $ Professional: num  34.7 22.3 31.4 27 49.6 24.2 19.5 42.8 31.5 29.3 ...
 $ Service     : num  17 24.7 24.9 20.8 14.2 17.5 29.6 10.7 17.5 13.7 ...
 $ Office      : num  21.3 21.5 22.1 27 18.2 35.4 25.3 34.2 26.1 17.7 ...
 $ Construction: num  11.9 9.4 9.2 8.7 2.1 7.9 10.1 5.5 7.8 11 ...
 $ Production  : num  15.2 22 12.4 16.4 15.8 14.9 15.5 6.8 17.1 28.3 ...
 $ Unemployment: num  5.4 13.3 6.2 10.8 4.2 10.9 11.4 8.2 8.7 7.2 ...
 $ IncomePerCap: int  25713 18021 20689 24125 27526 30480 20442 32813 24028 24710 ...
 - attr(*, "na.action")= 'omit' Named int  1484 1807 2299 2499 2789 4259 4444 4448 4449 4477 ...
  ..- attr(*, "names")= chr  "1514" "1851" "2370" "2574" ...

PCA

A Principle Component Analysis (PCA) and Principle Component Regression (PCR) seemed suited to this dataset. The purpose of this technique is to decrease the number of variables while accounting for collinearity. Within this dataset there are 12 variables to explain IncomePerCap. However, the correlation matrix shows notable correlation between some of the predictor variables. For example, Professional has notable correlations with Service, Construction, Production and Unemployment, White has notable correlations with Hispanic and Black, etc. From this inital overview of the correlation matrix, the prospect of PCA seemed suitable and was continued.

There were 70k+ data points being analyzed for this The biplot on the left shows the variation on the axes of PC1 and PC2 shows that PC1 has the most variation, between approx. -6 to 8, as confirmed by the summary data below Meanwhile PC2 goes between -10 to 7 Variables: Professional, Black, Production and White are pretty evenly split up between Pc1 and PC2 Other variables such as Office, Service, Unemployed, Construction, etc. are majorly represented in PC2 as compared to PC1

Importance of components:
                          PC1    PC2    PC3    PC4     PC5     PC6     PC7
Standard deviation     1.7878 1.3389 1.1653 1.0355 0.88819 0.82267 0.76933
Proportion of Variance 0.3196 0.1792 0.1358 0.1072 0.07889 0.06768 0.05919
Cumulative Proportion  0.3196 0.4989 0.6347 0.7419 0.82078 0.88845 0.94764
                           PC8     PC9     PC10
Standard deviation     0.71304 0.12303 0.003342
Proportion of Variance 0.05084 0.00151 0.000000
Cumulative Proportion  0.99849 1.00000 1.000000

The breakdown of the variation explained by each component shows that just over 50% of the variation is accounted for within the first three components. However, except for the first component, the change in the amount of variation explained in each consecutive component is similar. This is further illustrated by the following graph.


Call:
lm(formula = IncomePerCap ~ ., data = pcadata_pcr_rot)

Residuals:
   Min     1Q Median     3Q    Max 
-57889  -3154   -136   3093  39355 

Coefficients:
            Estimate Std. Error  t value Pr(>|t|)    
(Intercept) 26167.82      20.93 1250.463  < 2e-16 ***
PC1         -4585.05      11.71 -391.701  < 2e-16 ***
PC2         -1454.29      15.63  -93.043  < 2e-16 ***
PC3           604.54      17.96   33.664  < 2e-16 ***
PC4           994.55      20.21   49.214  < 2e-16 ***
PC5          -878.20      23.56  -37.274  < 2e-16 ***
PC6          1377.18      25.44   54.140  < 2e-16 ***
PC7          -205.74      27.20   -7.564 3.96e-14 ***
PC8          -196.99      29.35   -6.712 1.93e-11 ***
PC9          3301.06     170.10   19.407  < 2e-16 ***
PC10        -3519.28    6262.21   -0.562    0.574    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5519 on 69556 degrees of freedom
Multiple R-squared:  0.7102,    Adjusted R-squared:  0.7101 
F-statistic: 1.704e+04 on 10 and 69556 DF,  p-value: < 2.2e-16

Went ahead and ran a full PC regression of these components and all except the last component is significant. We also see that this regression only explains 66.5% of the variability in the dataset. The strongest variable is of course PC1 with a t-value with a magnitude by far larger than the rest of the variables.

R Square shows variation explained in the independent variable, IncomePerCap, based off of the components The steeper increase and then petering off that occurs in the R-Square graph seems to indicate that a significant amount of the variation of the data in regards to IncomePerCap is explained using just the first component

Based on the initial analysis of the R Square graph, and the results of the regression it seemed appropriate to run a regression on just PC1 which resulted in a lower Adjusted R Square.


Call:
lm(formula = IncomePerCap ~ PC1, data = pcadata_pcr_rot)

Residuals:
   Min     1Q Median     3Q    Max 
-42965  -4026   -442   3546  36606 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 26167.82      23.34  1121.0   <2e-16 ***
PC1         -4585.05      13.06  -351.1   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6157 on 69565 degrees of freedom
Multiple R-squared:  0.6393,    Adjusted R-squared:  0.6393 
F-statistic: 1.233e+05 on 1 and 69565 DF,  p-value: < 2.2e-16

The tradeoff between parsimony and description of these two potential models makes the choice of model unclear.

Assuming we choose the more explanatory model, accounting for the number of components included by adjusted R Square, We only eliminate one component or variable from the regression so we aren’t effectively parsing down However, we have low bias since we only dropped one component

<<<<<<< HEAD # K- Means ======= ##K- Means

List of 9
 $ cluster     : Named int [1:69567] 2 2 2 2 2 1 2 1 2 2 ...
  ..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
 $ centers     : num [1:2, 1:11] -0.329 0.174 0.42 -0.222 -0.32 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "1" "2"
  .. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
 $ totss       : num 7.31e+12
 $ withinss    : num [1:2] 1.15e+12 1.34e+12
 $ tot.withinss: num 2.5e+12
 $ betweenss   : num 4.81e+12
 $ size        : int [1:2] 24086 45481
 $ iter        : int 1
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
K-means clustering with 2 clusters of sizes 24086, 45481

Cluster means:
    Hispanic      White      Black      Asian Professional    Service
1 -0.3285506  0.4200874 -0.3199683  0.2372900    0.9282858 -0.6150829
2  0.1739950 -0.2224715  0.1694500 -0.1256649   -0.4916051  0.3257379
       Office Construction Production Unemployment IncomePerCap
1 -0.01890396   -0.4302842 -0.6556146   -0.5268802     37598.33
2  0.01001123    0.2278715  0.3472029    0.2790272     20114.40

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 2  2  2  2  2  1  2  1  2  2  2  2  2  2  2  1  2  2  1  1  2  2  1  2  2  2 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 
 2  2  1  1  1  1  1  2  1  1  2  1  1  2  2  2  2  2  2  2  2  2  2  2  2  2 
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 
 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2 
 [ reached getOption("max.print") -- omitted 69492 entries ]

Within cluster sum of squares by cluster:
[1] 1.153649e+12 1.344211e+12
 (between_SS / total_SS =  65.8 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

List of 9
 $ cluster     : Named int [1:69567] 2 1 1 2 2 2 1 2 2 2 ...
  ..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
 $ centers     : num [1:3, 1:11] 0.432 -0.24 -0.363 -0.575 0.34 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:3] "1" "2" "3"
  .. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
 $ totss       : num 7.31e+12
 $ withinss    : num [1:3] 4.33e+11 3.94e+11 3.93e+11
 $ tot.withinss: num 1.22e+12
 $ betweenss   : num 6.09e+12
 $ size        : int [1:3] 27175 29638 12754
 $ iter        : int 2
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
K-means clustering with 3 clusters of sizes 27175, 29638, 12754

Cluster means:
    Hispanic      White      Black       Asian Professional    Service
1  0.4318377 -0.5751979  0.3952287 -0.15378102   -0.7689991  0.6127621
2 -0.2397814  0.3399978 -0.2076592 -0.02410908    0.1329131 -0.2111302
3 -0.3629095  0.4354829 -0.3595529  0.38368702    1.3296436 -0.8149861
       Office Construction  Production Unemployment IncomePerCap
1 -0.02605785    0.2974405  0.51204618    0.6194822     16576.79
2  0.07327428    0.0108065 -0.07894429   -0.3101802     27821.96
3 -0.11475466   -0.6588701 -0.90756657   -0.5991303     42759.54

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 2  1  1  2  2  2  1  2  2  2  2  1  1  1  2  2  2  1  3  3  2  2  2  1  2  2 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 
 1  1  2  2  3  2  3  1  3  3  2  2  2  2  1  1  2  1  1  1  1  1  1  1  1  1 
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 
 1  1  2  1  1  1  1  1  1  1  1  2  1  1  1  1  1  2  1  1  1  1  1 
 [ reached getOption("max.print") -- omitted 69492 entries ]

Within cluster sum of squares by cluster:
[1] 433025278355 393691127581 392888818743
 (between_SS / total_SS =  83.3 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

List of 9
 $ cluster     : Named int [1:69567] 4 2 4 4 4 1 4 1 4 4 ...
  ..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
 $ centers     : num [1:4, 1:11] -0.302 0.668 -0.374 -0.132 0.407 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:4] "1" "2" "3" "4"
  .. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
 $ totss       : num 7.31e+12
 $ withinss    : num [1:4] 1.76e+11 1.92e+11 1.73e+11 1.75e+11
 $ tot.withinss: num 7.15e+11
 $ betweenss   : num 6.6e+12
 $ size        : int [1:4] 17574 17698 8266 26029
 $ iter        : int 2
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
K-means clustering with 4 clusters of sizes 17574, 17698, 8266, 26029

Cluster means:
    Hispanic      White       Black      Asian Professional     Service
1 -0.3016974  0.4066104 -0.28173021  0.1074823    0.5711741 -0.43512866
2  0.6680011 -0.8809881  0.57590840 -0.1651807   -0.9293827  0.83254425
3 -0.3738848  0.4389987 -0.37976189  0.4559152    1.5331982 -0.92442949
4 -0.1317654  0.1850702 -0.08076331 -0.1050413   -0.2406168  0.02128076
       Office Construction Production Unemployment IncomePerCap
1  0.06629532   -0.2290380 -0.4319137  -0.46098899     32840.09
2 -0.05484069    0.3137655  0.5749681   0.89883620     14433.73
3 -0.17952039   -0.7693828 -1.0184110  -0.62970642     45780.77
4  0.04953752    0.1856318  0.2240905  -0.09992813     23412.84

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 4  2  4  4  4  1  4  1  4  4  4  4  4  2  4  1  4  2  1  1  4  4  1  2  4  4 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 
 2  4  1  1  3  1  1  4  1  3  4  1  1  4  4  4  4  4  2  2  2  2  2  4  4  2 
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 
 2  4  4  2  4  4  4  2  2  2  4  4  4  2  2  4  4  4  2  4  2  4  4 
 [ reached getOption("max.print") -- omitted 69492 entries ]

Within cluster sum of squares by cluster:
[1] 175536566241 191571930523 172553341760 175374034150
 (between_SS / total_SS =  90.2 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

List of 9
 $ cluster     : Named int [1:69567] 5 1 1 1 5 5 1 4 1 5 ...
  ..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
 $ centers     : num [1:5, 1:11] -0.00846 0.85002 -0.3769 -0.33227 -0.24884 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:5] "1" "2" "3" "4" ...
  .. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
 $ totss       : num 7.31e+12
 $ withinss    : num [1:5] 9.11e+10 1.06e+11 9.67e+10 9.02e+10 8.75e+10
 $ tot.withinss: num 4.72e+11
 $ betweenss   : num 6.84e+12
 $ size        : int [1:5] 20760 12813 6192 11579 18223
 $ iter        : int 2
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
K-means clustering with 5 clusters of sizes 20760, 12813, 6192, 11579, 18223

Cluster means:
      Hispanic       White       Black       Asian Professional    Service
1 -0.008456987 -0.01428733  0.06737527 -0.12483068   -0.4552929  0.2049365
2  0.850017949 -1.08804866  0.68083922 -0.17734851   -1.0217088  0.9792310
3 -0.376900236  0.44391016 -0.39300833  0.48161313    1.6342530 -0.9824220
4 -0.332267736  0.42488521 -0.31698657  0.22115974    0.8637778 -0.5739916
5 -0.248840396  0.36049690 -0.22051300 -0.03726641    0.1329121 -0.2234518
       Office Construction  Production Unemployment IncomePerCap
1  0.02032925   0.25592838  0.38069044   0.09861782     20788.54
2 -0.07442558   0.32126962  0.59350367   1.09913874     13091.47
3 -0.21879800  -0.82200303 -1.06575585  -0.64529647     47520.79
4  0.02701923  -0.40030391 -0.64316728  -0.52575884     36236.88
5  0.08634809   0.01621363 -0.08018998  -0.33184070     27836.80

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 5  1  1  1  5  5  1  4  1  5  1  1  1  1  5  5  1  2  4  4  5  5  5  1  1  1 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 
 1  1  5  5  4  4  4  1  4  3  1  5  5  1  1  1  5  1  2  1  2  1  2  1  1  1 
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 
 1  1  1  2  1  1  1  1  1  1  1  5  1  2  2  1  1  5  1  1  2  1  1 
 [ reached getOption("max.print") -- omitted 69492 entries ]

Within cluster sum of squares by cluster:
[1]  91107426890 106355408998  96685021931  90198216197  87531045107
 (between_SS / total_SS =  93.5 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      
List of 9
 $ cluster     : Named int [1:69567] 5 1 1 1 5 5 1 4 1 5 ...
  ..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
 $ centers     : num [1:5, 1:11] -0.00846 0.85002 -0.3769 -0.33227 -0.24884 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:5] "1" "2" "3" "4" ...
  .. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
 $ totss       : num 7.31e+12
 $ withinss    : num [1:5] 9.11e+10 1.06e+11 9.67e+10 9.02e+10 8.75e+10
 $ tot.withinss: num 4.72e+11
 $ betweenss   : num 6.84e+12
 $ size        : int [1:5] 20760 12813 6192 11579 18223
 $ iter        : int 2
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
K-means clustering with 5 clusters of sizes 20760, 12813, 6192, 11579, 18223

Cluster means:
      Hispanic       White       Black       Asian Professional    Service
1 -0.008456987 -0.01428733  0.06737527 -0.12483068   -0.4552929  0.2049365
2  0.850017949 -1.08804866  0.68083922 -0.17734851   -1.0217088  0.9792310
3 -0.376900236  0.44391016 -0.39300833  0.48161313    1.6342530 -0.9824220
4 -0.332267736  0.42488521 -0.31698657  0.22115974    0.8637778 -0.5739916
5 -0.248840396  0.36049690 -0.22051300 -0.03726641    0.1329121 -0.2234518
       Office Construction  Production Unemployment IncomePerCap
1  0.02032925   0.25592838  0.38069044   0.09861782     20788.54
2 -0.07442558   0.32126962  0.59350367   1.09913874     13091.47
3 -0.21879800  -0.82200303 -1.06575585  -0.64529647     47520.79
4  0.02701923  -0.40030391 -0.64316728  -0.52575884     36236.88
5  0.08634809   0.01621363 -0.08018998  -0.33184070     27836.80

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 5  1  1  1  5  5  1  4  1  5  1  1  1  1  5  5  1  2  4  4  5  5  5  1  1  1 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 
 1  1  5  5  4  4  4  1  4  3  1  5  5  1  1  1  5  1  2  1  2  1  2  1  1  1 
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 
 1  1  1  2  1  1  1  1  1  1  1  5  1  2  2  1  1  5  1  1  2  1  1 
 [ reached getOption("max.print") -- omitted 69492 entries ]

Within cluster sum of squares by cluster:
[1]  91107426890 106355408998  96685021931  90198216197  87531045107
 (between_SS / total_SS =  93.5 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      
List of 9
 $ cluster     : Named int [1:69567] 6 4 4 6 6 2 4 2 6 6 ...
  ..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
 $ centers     : num [1:6, 1:11] -0.349 -0.285 0.974 0.132 -0.385 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:6] "1" "2" "3" "4" ...
  .. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
 $ totss       : num 7.31e+12
 $ withinss    : num [1:6] 5.33e+10 5.39e+10 6.79e+10 5.20e+10 5.61e+10 ...
 $ tot.withinss: num 3.36e+11
 $ betweenss   : num 6.98e+12
 $ size        : int [1:6] 8294 13096 9901 16290 4763 17223
 $ iter        : int 3
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
K-means clustering with 6 clusters of sizes 8294, 13096, 9901, 16290, 4763, 17223

Cluster means:
    Hispanic      White      Black       Asian Professional     Service
1 -0.3494717  0.4296992 -0.3360808  0.30654695    1.0897348 -0.68272359
2 -0.2854601  0.3976771 -0.2665621  0.05425283    0.4263884 -0.36762711
3  0.9739417 -1.2113147  0.7268560 -0.18905224   -1.0768116  1.07158297
4  0.1316581 -0.2288132  0.2195608 -0.13438783   -0.6075451  0.36540166
5 -0.3852568  0.4492850 -0.4005145  0.50197893    1.7117093 -1.02643109
6 -0.1925231  0.2792048 -0.1502203 -0.09190832   -0.1287054 -0.06945889
        Office Construction Production Unemployment IncomePerCap
1 -0.029446162   -0.5329557 -0.7839551   -0.5661097     38948.40
2  0.089116820   -0.1457015 -0.3276215   -0.4288697     31179.25
3 -0.088194644    0.3291525  0.5983303    1.2390235     12157.33
4  0.007625546    0.2812103  0.4727567    0.2799036     18933.00
5 -0.254726418   -0.8568239 -1.1023498   -0.6538796     48913.98
6  0.060350087    0.1491981  0.1403863   -0.1974674     24809.23

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 6  4  4  6  6  2  4  2  6  6  6  4  4  4  6  2  6  3  1  1  6  6  2  4  6  6 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 
 4  4  2  2  1  2  1  4  1  5  6  2  2  6  4  4  6  4  3  4  3  4  3  4  4  4 
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 
 4  4  6  4  4  4  4  4  4  4  4  6  4  4  4  4  4  6  4  4  3  4  6 
 [ reached getOption("max.print") -- omitted 69492 entries ]

Within cluster sum of squares by cluster:
[1] 53289943534 53860735274 67861554081 51981016101 56119303347 52399097534
 (between_SS / total_SS =  95.4 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      
List of 9
 $ cluster     : Named int [1:69567] 2 5 7 7 2 2 7 3 7 7 ...
  ..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
 $ centers     : num [1:7, 1:11] -0.396 -0.258 -0.313 1.054 0.237 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:7] "1" "2" "3" "4" ...
  .. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
 $ totss       : num 7.31e+12
 $ withinss    : num [1:7] 3.01e+10 3.39e+10 3.39e+10 5.06e+10 3.68e+10 ...
 $ tot.withinss: num 2.52e+11
 $ betweenss   : num 7.06e+12
 $ size        : int [1:7] 3586 13111 9445 8258 13680 6091 15396
 $ iter        : int 2
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
K-means clustering with 7 clusters of sizes 3586, 13111, 9445, 8258, 13680, 6091, 15396

Cluster means:
    Hispanic      White       Black      Asian Professional    Service
1 -0.3964506  0.4542494 -0.40764714  0.5282089    1.7925081 -1.0717282
2 -0.2576300  0.3733856 -0.23385201 -0.0269068    0.1755536 -0.2474022
3 -0.3125257  0.4200419 -0.30528640  0.1527982    0.6907586 -0.4893448
4  1.0539699 -1.2761352  0.73515083 -0.1940444   -1.1025060  1.1203953
5  0.2366030 -0.3958078  0.34172930 -0.1373108   -0.7010158  0.4859066
6 -0.3575666  0.4294729 -0.34914245  0.3685170    1.2778561 -0.7833195
       Office Construction Production Unemployment IncomePerCap
1 -0.29189615 -0.895033912 -1.1399021  -0.65954880     50244.25
2  0.08656937 -0.003686199 -0.1156669  -0.35053069     28266.51
3  0.06459196 -0.301790762 -0.5299635  -0.49757889     34274.90
4 -0.08602009  0.329565942  0.5903558   1.33577584     11570.50
5 -0.02053170  0.292153979  0.5252348   0.41200162     17787.86
6 -0.07512151 -0.640358239 -0.8939002  -0.59387071     41486.24
 [ reached getOption("max.print") -- omitted 1 row ]

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 2  5  7  7  2  2  7  3  7  7  7  7  5  5  7  2  7  5  3  3  2  2  2  5  7  7 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 
 5  5  2  2  6  3  3  7  3  6  7  2  3  7  5  5  2  5  4  5  5  5  4  5  7  5 
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 
 5  5  7  5  5  7  7  5  5  5  5  2  5  5  5  7  5  2  5  7  4  5  7 
 [ reached getOption("max.print") -- omitted 69492 entries ]

Within cluster sum of squares by cluster:
[1] 30078367255 33918666154 33924690315 50620627295 36763050021 31979820579
[7] 34290620831
 (between_SS / total_SS =  96.6 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      
List of 9
 $ cluster     : Named int [1:69567] 1 8 7 1 1 5 7 5 1 1 ...
  ..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
 $ centers     : num [1:8, 1:11] -0.21 -0.398 -0.331 -0.359 -0.276 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:8] "1" "2" "3" "4" ...
  .. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
 $ totss       : num 7.31e+12
 $ withinss    : num [1:8] 2.38e+10 2.24e+10 2.47e+10 2.28e+10 2.44e+10 ...
 $ tot.withinss: num 1.95e+11
 $ betweenss   : num 7.12e+12
 $ size        : int [1:8] 13055 3152 7742 5140 10623 6131 13192 10532
 $ iter        : int 2
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
K-means clustering with 8 clusters of sizes 13055, 3152, 7742, 5140, 10623, 6131, 13192, 10532

Cluster means:
    Hispanic       White       Black       Asian Professional     Service
1 -0.2099021  0.31065143 -0.17687237 -0.08368332  -0.08389187 -0.09881344
2 -0.3981141  0.45578358 -0.40960492  0.53311507   1.81785638 -1.08784127
3 -0.3314624  0.43062774 -0.31765082  0.20067007   0.83132143 -0.55740904
4 -0.3588452  0.42873072 -0.36110010  0.40713285   1.35698554 -0.82340123
5 -0.2764009  0.38868612 -0.25353852  0.02632578   0.34875186 -0.33062038
6  1.1604954 -1.34363273  0.72389507 -0.20529827  -1.11835632  1.17173776
       Office Construction Production Unemployment IncomePerCap
1  0.06138421    0.1289094  0.1064936  -0.23047133     25340.47
2 -0.30249732   -0.9092239 -1.1486967  -0.66205029     50788.01
3  0.03930789   -0.3810941 -0.6273035  -0.52118182     35892.73
4 -0.10289042   -0.6828886 -0.9379571  -0.61013558     42677.39
5  0.09089521   -0.1005031 -0.2648395  -0.40948506     30223.86
6 -0.08558404    0.3344531  0.5598170   1.46755772     10698.00
 [ reached getOption("max.print") -- omitted 2 rows ]

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 1  8  7  1  1  5  7  5  1  1  1  7  7  8  1  5  7  8  3  3  1  1  5  7  7  7 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 
 8  7  5  5  4  5  3  7  3  4  1  5  5  1  7  7  1  7  6  8  8  8  8  7  7  7 
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 
 8  7  7  8  7  7  7  8  7  8  7  1  7  8  8  7  7  1  7  7  6  7  7 
 [ reached getOption("max.print") -- omitted 69492 entries ]

Within cluster sum of squares by cluster:
[1] 23806194051 22351110068 24674184543 22763698227 24425126210 32229516603
[7] 22734152115 22364091140
 (between_SS / total_SS =  97.3 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      
List of 9
 $ cluster     : Named int [1:69567] 7 3 3 7 9 1 3 1 7 7 ...
  ..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
 $ centers     : num [1:9, 1:11] -0.294 -0.347 0.049 -0.402 1.22 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:9] "1" "2" "3" "4" ...
  .. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
 $ totss       : num 7.31e+12
 $ withinss    : num [1:9] 1.54e+10 1.52e+10 1.73e+10 1.58e+10 2.42e+10 ...
 $ tot.withinss: num 1.55e+11
 $ betweenss   : num 7.16e+12
 $ size        : int [1:9] 8025 5905 11753 2719 4994 4354 12230 9140 10447
 $ iter        : int 3
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
K-means clustering with 9 clusters of sizes 8025, 5905, 11753, 2719, 4994, 4354, 12230, 9140, 10447

Cluster means:
     Hispanic      White      Black       Asian Professional     Service
1 -0.29398811  0.4084707 -0.2881384  0.09801206    0.5412472 -0.41887898
2 -0.34732990  0.4326622 -0.3266366  0.26353963    0.9956851 -0.63591035
3  0.04903299 -0.1030093  0.1312846 -0.13421926   -0.5427295  0.27794329
4 -0.40241766  0.4583190 -0.4144673  0.54771279    1.8491328 -1.10706894
5  1.21957721 -1.3726251  0.7042742 -0.21976591   -1.1201758  1.19133726
6 -0.35966388  0.4293714 -0.3692219  0.42709539    1.4300252 -0.86183629
        Office Construction  Production Unemployment IncomePerCap
1  0.092171984 -0.214882278 -0.42684953   -0.4601316     32552.93
2 -0.003486057 -0.478374937 -0.72844821   -0.5525047     37694.40
3  0.017290823  0.273752980  0.44804925    0.1820710     19710.62
4 -0.310870838 -0.926207144 -1.16439186   -0.6642217     51363.70
5 -0.083922955  0.335218762  0.54026333    1.5507180     10154.85
6 -0.136970585 -0.719826670 -0.97229327   -0.6191718     43867.61
 [ reached getOption("max.print") -- omitted 3 rows ]

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 7  3  3  7  9  1  3  1  7  7  7  3  3  3  7  9  7  8  2  2  9  9  9  3  7  7 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 
 3  3  1  9  2  1  2  3  2  6  7  1  1  7  3  3  9  3  5  8  8  8  8  3  3  3 
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 
 3  3  7  8  3  3  3  8  3  3  3  9  3  8  8  3  3  7  3  3  5  3  7 
 [ reached getOption("max.print") -- omitted 69492 entries ]

Within cluster sum of squares by cluster:
[1] 15416927509 15194151787 17298838728 15761413559 24236871148 16554689443
[7] 17160968891 17348442911 16348725566
 (between_SS / total_SS =  97.9 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      
List of 9
 $ cluster     : Named int [1:69567] 9 2 8 9 5 5 8 10 9 9 ...
  ..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
 $ centers     : num [1:10, 1:11] 1.293 0.172 -0.361 0.733 -0.269 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:10] "1" "2" "3" "4" ...
  .. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
 $ totss       : num 7.31e+12
 $ withinss    : num [1:10] 1.67e+10 1.25e+10 1.39e+10 1.17e+10 1.19e+10 ...
 $ tot.withinss: num 1.28e+11
 $ betweenss   : num 7.18e+12
 $ size        : int [1:10] 3740 9868 3958 7455 8799 2488 5279 10782 10229 6969
 $ iter        : int 2
 $ ifault      : int 4
 - attr(*, "class")= chr "kmeans"
K-means clustering with 10 clusters of sizes 3740, 9868, 3958, 7455, 8799, 2488, 5279, 10782, 10229, 6969

Cluster means:
      Hispanic      White       Black        Asian Professional    Service
1   1.29347073 -1.3926502  0.65110535 -0.227144565  -1.10439711  1.2206491
2   0.17193979 -0.2995527  0.27181346 -0.131665132  -0.65475827  0.4157717
3  -0.36067127  0.4329611 -0.37594612  0.433931771   1.47068572 -0.8852975
4   0.73251794 -1.0447516  0.74170741 -0.163014767  -1.02683185  0.9384102
5  -0.26944247  0.3834500 -0.24218980 -0.002215569   0.27358175 -0.2989392
6  -0.40143948  0.4574908 -0.41624907  0.552865396   1.86667764 -1.1156001
         Office Construction  Production Unemployment IncomePerCap
1  -0.076264666   0.32954885  0.47907560   1.65962676      9446.23
2  -0.005854058   0.28803780  0.50879858   0.34460804     18266.13
3  -0.150092261  -0.74228280 -0.99238368  -0.63110451     44529.44
4  -0.088022816   0.32539421  0.65367907   0.92950639     14162.91
5   0.094618993  -0.05725246 -0.20068842  -0.38834330     29357.81
6  -0.323599903  -0.93488726 -1.16996086  -0.66198939     51688.15
 [ reached getOption("max.print") -- omitted 4 rows ]

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 9  2  8  9  5  5  8 10  9  9  8  8  2  2  9  5  8  4 10  7  9  9  5  2  8  8 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 
 2  2  5  5  7 10 10  8 10  3  8  5 10  8  2  2  9  2  4  2  4  2  4  2  8  2 
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 
 2  2  8  4  2  8  8  2  2  2  8  9  2  4  2  8  2  9  2  8  4  2  8 
 [ reached getOption("max.print") -- omitted 69492 entries ]

Within cluster sum of squares by cluster:
 [1] 16668900913 12479477177 13910959117 11660948482 11895068060 12674139695
 [7] 12952720882 11959674961 11678714620 12133241415
 (between_SS / total_SS =  98.2 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

K-means is an unsupervised learning algorithm. The goal of this program is to find groups or clusters of data in order to identify certain patterns. All of the values in the data set were normalized along the normal distribution to make comparisons of the overall dataset on a similar scale. K-means was used for 2,3,4,5,6,7,8,9, and 10 clusters. On inspection of the clusters created from k=2, The cluster that had the highest IncomePerCap at 37598 had the highest cluster mean of professional at 0.928, White at 0.420 and Asian at 0.237. the cluster plot chart has all the 70,000 datapoints in green and the two different clusters in blue and red respectively.It appears that there is overlap of the clusters however this occurs as the plot takes all the different data points and plots them on a two dimensional graph. With only two clusters it captures about 65.8% of the cluster sum of squares. Further inspection was constructed for a model with k =3. The cluster with the highest IncomePerCap was found to be cluster three at 42760. this cluster also had the highest cluster mean for Professional at 1.330 and Asian at 0.3837. The first cluster which had a IncomePerCap cluster mean of 16577 had the highest uneployment cluster average at 0.619. the cluster plot has three distinct clusters portrayed and the overlap makes it a little difficult to see which cluster is which. With only three clusters, 83.3% of the data is captured which is a drastric improvement from only two clusters. A final analysis was constructed for a model with k=4. The cluster with the highest IncomePerCap was found to be cluster three with 45781. this cluster had the hgihest Professional cluster average at 1.533 and the highest Asian cluster averge at 0.456. The cluster with the lowest IncomePerCap was cluster two at 14434. It had the highest unemployment cluster average at 0.8988. The cluster plot is difficult to interpret as the all of the datapoints were brought to a two dimensional scale and now there are four different clusters. With only four clusters, 90.2% of the data is captured which is a drastric improvement from only two clusters. As the clusters increased from 5 to 10, the percentage captured did not increase drastically. For example when k= 10, 98.2% of the data is captured. So a cluster of fourr would be sufficient as it would capture a sufficient amount of the data.

KNN

Preprocessing KNN

 Factor w/ 2 levels "[855,2.47e+04]",..: 2 1 1 1 2 2 1 2 1 1 ...
[1] "factor"
[1] "[855,2.47e+04]"     "(2.47e+04,5.6e+04]"
 Factor w/ 4 levels "[855,1.88e+04]",..: 3 1 2 2 3 3 2 4 2 2 ...
[1] "factor"
[1] "[855,1.88e+04]"      "(1.88e+04,2.47e+04]" "(2.47e+04,3.23e+04]"
[4] "(3.23e+04,5.6e+04]" 
'data.frame':   69567 obs. of  12 variables:
 $ Hispanic    : num  0.9 0.8 0 10.5 0.7 13.1 3.8 1.3 1.4 0.4 ...
 $ White       : num  87.4 40.4 74.5 82.8 68.5 72.9 74.5 84 89.5 85.5 ...
 $ Black       : num  7.7 53.3 18.6 3.7 24.8 11.9 19.7 10.7 8.4 12.1 ...
 $ Asian       : num  0.6 2.3 1.4 0 3.8 0 0 0 0 0.3 ...
 $ Professional: num  34.7 22.3 31.4 27 49.6 24.2 19.5 42.8 31.5 29.3 ...
 $ Service     : num  17 24.7 24.9 20.8 14.2 17.5 29.6 10.7 17.5 13.7 ...
 $ Office      : num  21.3 21.5 22.1 27 18.2 35.4 25.3 34.2 26.1 17.7 ...
 $ Construction: num  11.9 9.4 9.2 8.7 2.1 7.9 10.1 5.5 7.8 11 ...
 $ Production  : num  15.2 22 12.4 16.4 15.8 14.9 15.5 6.8 17.1 28.3 ...
 $ Unemployment: num  5.4 13.3 6.2 10.8 4.2 10.9 11.4 8.2 8.7 7.2 ...
 $ ipc2        : Factor w/ 2 levels "[855,2.47e+04]",..: 2 1 1 1 2 2 1 2 1 1 ...
 $ ipc4        : Factor w/ 4 levels "[855,1.88e+04]",..: 3 1 2 2 3 3 2 4 2 2 ...
[1] 0
[1] 0
[1] 12
'data.frame':   69567 obs. of  12 variables:
 $ Hispanic    : num  0.9 0.8 0 10.5 0.7 13.1 3.8 1.3 1.4 0.4 ...
 $ White       : num  87.4 40.4 74.5 82.8 68.5 72.9 74.5 84 89.5 85.5 ...
 $ Black       : num  7.7 53.3 18.6 3.7 24.8 11.9 19.7 10.7 8.4 12.1 ...
 $ Asian       : num  0.6 2.3 1.4 0 3.8 0 0 0 0 0.3 ...
 $ Professional: num  34.7 22.3 31.4 27 49.6 24.2 19.5 42.8 31.5 29.3 ...
 $ Service     : num  17 24.7 24.9 20.8 14.2 17.5 29.6 10.7 17.5 13.7 ...
 $ Office      : num  21.3 21.5 22.1 27 18.2 35.4 25.3 34.2 26.1 17.7 ...
 $ Construction: num  11.9 9.4 9.2 8.7 2.1 7.9 10.1 5.5 7.8 11 ...
 $ Production  : num  15.2 22 12.4 16.4 15.8 14.9 15.5 6.8 17.1 28.3 ...
 $ Unemployment: num  5.4 13.3 6.2 10.8 4.2 10.9 11.4 8.2 8.7 7.2 ...
 $ ipc2        : Factor w/ 2 levels "[855,2.47e+04]",..: 2 1 1 1 2 2 1 2 1 1 ...
 $ ipc4        : Factor w/ 4 levels "[855,1.88e+04]",..: 3 1 2 2 3 3 2 4 2 2 ...

KNN Model

Train-Test split 3:1

KNN 2 categories

Selecting the correct “k”

How does “k” affect classification accuracy? Let’s create a function to calculate classification accuracy based on the number of “k.”

 num [1:2, 1:15] 1 0.796 3 0.823 5 ...

Results

 Factor w/ 2 levels "[855,2.47e+04]",..: 1 1 1 1 1 2 2 1 1 1 ...
 - attr(*, "nn.index")= int [1:22836, 1:9] 31430 8744 21004 2152 14716 18952 43436 37471 18814 14542 ...
 - attr(*, "nn.dist")= num [1:22836, 1:9] 0.569 0.47 0.541 0.497 0.401 ...
[1] 22836
dat_pred_ipc2
[855,2.47e+04]           High 
         11177          11659 
                dat_ipc2.testLabels
dat_pred_ipc2    [855,2.47e+04] High
  [855,2.47e+04]           9492 1685
  High                     1940 9719
[1] 22836
[1] 9492 9719
[1] 0.8412594
      Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
  8.412594e-01   6.825270e-01   8.364545e-01   8.459773e-01   5.006131e-01 
AccuracyPValue  McnemarPValue 
  0.000000e+00   2.457036e-05 
         Sensitivity          Specificity       Pos Pred Value 
           0.8303009            0.8522448            0.8492440 
      Neg Pred Value            Precision               Recall 
           0.8336049            0.8492440            0.8303009 
                  F1           Prevalence       Detection Rate 
           0.8396656            0.5006131            0.4156595 
Detection Prevalence    Balanced Accuracy 
           0.4894465            0.8412729 

KNN 4 categories

Selecting the correct “k”

How does “k” affect classification accuracy? Let’s create a function to calculate classification accuracy based on the number of “k.”

 num [1:2, 1:15] 1 0.563 3 0.597 5 ...

Results

 Factor w/ 4 levels "[855,1.88e+04]",..: 1 2 2 1 2 3 3 1 1 2 ...
[1] 22836
dat_pred_ipc4
     [855,1.88e+04]             Mid-Low (2.47e+04,3.23e+04]  (3.23e+04,5.6e+04] 
               5284                5917                5721                5914 
                     dat_ipc4.testLabels
dat_pred_ipc4         [855,1.88e+04] Mid-Low (2.47e+04,3.23e+04]
  [855,1.88e+04]                4128     981                 159
  Mid-Low                       1293    3015                1440
  (2.47e+04,3.23e+04]            237    1483                2919
  (3.23e+04,5.6e+04]              65     230                1215
                     dat_ipc4.testLabels
dat_pred_ipc4         (3.23e+04,5.6e+04]
  [855,1.88e+04]                      16
  Mid-Low                            169
  (2.47e+04,3.23e+04]               1082
  (3.23e+04,5.6e+04]                4404
[1] 22836
[1] 4128 3015 2919 4404
[1] 0.6334735
      Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
  6.334735e-01   5.113148e-01   6.271853e-01   6.397277e-01   2.510510e-01 
AccuracyPValue  McnemarPValue 
  0.000000e+00   1.805656e-20 
                           Sensitivity Specificity Pos Pred Value
Class: [855,1.88e+04]        0.7213000   0.9324490      0.7812263
Class: Mid-Low               0.5281135   0.8305599      0.5095488
Class: (2.47e+04,3.23e+04]   0.5091575   0.8361691      0.5102255
Class: (3.23e+04,5.6e+04]    0.7765826   0.9120303      0.7446737
                           Neg Pred Value Precision    Recall        F1
Class: [855,1.88e+04]           0.9091272 0.7812263 0.7213000 0.7500681
Class: Mid-Low                  0.8407707 0.5095488 0.5281135 0.5186651
Class: (2.47e+04,3.23e+04]      0.8355828 0.5102255 0.5091575 0.5096909
Class: (3.23e+04,5.6e+04]       0.9251271 0.7446737 0.7765826 0.7602935
                           Prevalence Detection Rate Detection Prevalence
Class: [855,1.88e+04]       0.2506131      0.1807672            0.2313890
Class: Mid-Low              0.2500000      0.1320284            0.2591084
Class: (2.47e+04,3.23e+04]  0.2510510      0.1278245            0.2505255
Class: (3.23e+04,5.6e+04]   0.2483360      0.1928534            0.2589771
                           Balanced Accuracy
Class: [855,1.88e+04]              0.8268745
Class: Mid-Low                     0.6793367
Class: (2.47e+04,3.23e+04]         0.6726633
Class: (3.23e+04,5.6e+04]          0.8443065

Selecting the correct “k”

How does “k” affect classification accuracy? Let’s create a function to calculate classification accuracy based on the number of “k.”

 num [1:2, 1:15] 1 0.563 3 0.597 5 ...

Regressions

[1]  15 100

the dataset had 10 dependent variables predicting the independent variable. Ridge regression was introduced as it minimized the residual sum of squares and has a shrinkage penalty of lambda times by the sum of squares of the coefficients. Overall as lambda increases, the coefficients apprach zero. this plot indicates the entire path of variables as they shring towards zero. To build the ridge regression, a log sacle grid for the lambda values was constucted from 10^10 to 10^-2 in 100 segments.

Train and Test sets

To avoid introducing a bias in developing the Ridge and Lasso regression a train and test data set were introduced. To simulate a train and test set there was a random split into 50% for the train set.

[1] 824.8974
lowest lamda from CV:  824.8974 
MSE for best Ridge lamda:  12154666 

All the coefficients : 
 (Intercept)     Hispanic        White        Black        Asian Professional 
  19143.7093    -300.1618     436.8876     -34.8267     267.4880    1313.3468 
     Service       Office Construction   Production Unemployment 
   -988.9918    -358.9968    -465.5783    -707.8713    -806.0590 

R^2: 
[1] 0.8843423

Lasso

In order to be the best model for Ridge regression, cross validation was implimented to find the best fit. The cross validation line graph indicates that a model with ten dependent variables would yield the lowest lambda with the lowest mean square error. As the lambda value decreases, the mean square error also decreases. Overall, Ridge Regression includes all the of the dependent variables and the best value for lambda is indicated by the first vertical line. The lowest lamda from the cross validation was found to be 825. The MSE for the best Ridge Lambda equation was 30834392. from the equation, the model that had the most positive coefficient valus were professional at 3248, white at 1104 and asian at 546. the values that had the strongest negative coeffiecents were service at -2195, production at -2021 and unemployment at -1602. It was interesting to note that only professional had a positive lambda while the other work variables were all negative. The R^2 value for the best Ridge model was found to be 0.707. this means that 70.7% of the variation in the income can be explained by the model.

lowest lamda from CV:  16.19307 
 MSE for best Lasso lamda:  30709528 

All the coefficients : 
 (Intercept)     Hispanic        White        Black        Asian Professional 
 26167.81892     13.56463   1690.00411    247.03605    712.23492   6030.24803 
     Service       Office Construction   Production Unemployment 
  -716.51986    367.51269      0.00000   -622.32649  -1612.95442 

The non-zero coefficients : 
 (Intercept)     Hispanic        White        Black        Asian Professional 
 26167.81892     13.56463   1690.00411    247.03605    712.23492   6030.24803 
     Service       Office   Production Unemployment 
  -716.51986    367.51269   -622.32649  -1612.95442 
[1] 0.7077836
Lasso regression was also implimented to see if this model would perform differently from the regression or Ridge model. Lasso regression can be useful in reducing over-fittness and assist in model selection. from the line plot it can be seen that the three most positive coefficient values are professional at 6030.2, white at 1690, and asian at 712.2. This means that professional, white and asian have much stronger positive pull on the data that the other variables. The three most negtive coefficient values are unemployment at -1613, service at -716.5, and producton at -622.3. Construction was found to have a coefficient value of 0.0 so it was removed for the final Lasso model. It is interesting to note that the lambda values for hispanic are small at 13.6 so they do not deviate much from the ordinary least squares model (OLS).
Cross validation was introduced to select the lambda value with the lowest MSE. The CV recommended eight dependent variables be used to predict income. The Lasso regresison recommended that construction be removed from the equation. the Cross validaiton value was found to be 16.2 and the MSE for the best Lasso model was 30709528. Also the r^2 value was found to be 0.708. this means that 70.8% of the variation in Income can be explained by the model. 

Call:
lm(formula = IncomePerCap ~ ., data = datJLClean)

Residuals:
     Min       1Q   Median       3Q      Max 
-27748.0  -1894.5    -20.2   1770.9  18988.4 

Coefficients: (1 not defined because of singularities)
                        Estimate Std. Error  t value Pr(>|t|)    
(Intercept)             17391.74      36.71  473.707  < 2e-16 ***
Hispanic                  104.54      54.22    1.928   0.0539 .  
White                     669.66      72.93    9.183  < 2e-16 ***
Black                     334.20      52.11    6.414 1.43e-10 ***
Asian                     326.28      25.82   12.635  < 2e-16 ***
Professional             1077.33    2706.77    0.398   0.6906    
Service                  -814.52    1609.32   -0.506   0.6128    
Office                   -358.87    1173.52   -0.306   0.7598    
Construction             -378.99    1193.57   -0.318   0.7508    
Production               -515.94    1505.74   -0.343   0.7319    
Unemployment             -616.00      17.07  -36.087  < 2e-16 ***
ipc2High                19892.01      62.25  319.559  < 2e-16 ***
ipc4Mid-Low              5181.26      42.82  120.998  < 2e-16 ***
ipc4(2.47e+04,3.23e+04] -9860.07      41.82 -235.757  < 2e-16 ***
ipc4(3.23e+04,5.6e+04]        NA         NA       NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3412 on 69553 degrees of freedom
Multiple R-squared:  0.8893,    Adjusted R-squared:  0.8892 
F-statistic: 4.296e+04 on 13 and 69553 DF,  p-value: < 2.2e-16

Call:
lm(formula = IncomePerCap ~ Hispanic + White + Black + Asian + 
    Professional + Service + Office + Production + Unemployment, 
    data = datJLClean)

Residuals:
   Min     1Q Median     3Q    Max 
-57889  -3155   -139   3092  39315 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26167.82      20.93 1250.46   <2e-16 ***
Hispanic       972.98      87.56   11.11   <2e-16 ***
White         2983.19     117.35   25.42   <2e-16 ***
Black         1175.89      84.20   13.97   <2e-16 ***
Asian         1115.17      41.58   26.82   <2e-16 ***
Professional  5991.86      53.93  111.11   <2e-16 ***
Service       -737.90      39.77  -18.55   <2e-16 ***
Office         359.14      28.63   12.54   <2e-16 ***
Production    -667.56      40.62  -16.44   <2e-16 ***
Unemployment -1604.14      26.65  -60.19   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5519 on 69557 degrees of freedom
Multiple R-squared:  0.7102,    Adjusted R-squared:  0.7101 
F-statistic: 1.894e+04 on 9 and 69557 DF,  p-value: < 2.2e-16

MSE for full model : 
[1] 11638291

MSE for full model (w/o construction) : 
[1] 30460435

An OLS model was consturcted for both the full model and the full model without the construction variable to compare them to the Ridge and Lasso models. the R^2 value for both the OLS models was found to be 0.71. this means that both the ordinary least squares models explain 71% of the variation in income can be explainedby the model. Furthermore the MSE for the full model was found to be 30459848. The full model withouth the construction variable was found to have a larger MSE at 30460435. Overall the Lasso, Ridge, and both OLS models explain aobut roughly the same amount of variability in the data. Also all of the R^2 values are about the same around 0.70. Since the full OLS has the lowest MSE and the highest R^2 it would be a more suitable option than the Ridge, Lasso, or OLS without construction.

Conclusion

Overall, this analysis found that there are several ways in which our independent variables reliably predict income in communities across the United States. The Freedom variables we drew from the Cato Institute performed poorest, with a high internal correlation and little predictive power. Ethnicity and work type proportions had stronger predictive power, with the latter having the most powerful effects. However, these variables suffer from being largely non-normal, with a rightward skew, and from having high internal correlations, both between and within the two categories. Altogether, these variables allow us to predict income per capita at the census tract level with high reliability (R-squared = .67); this is actually quite impressive given the simplicity of this data. For instance, it does not directly include any information about the age or education of the population. Moving forward, this analysis allows for several expansions. The first is to integrate new data, such as age and education status of census tract residents. Additionally, it may be valuable to consider each of the individual freedom measures on its own, to negate the influence of high internal correlation. Finally, it is interesting if there are differences driven by geographic density, which can be estimated with just the currently accessible data.

Bibliography

Cato Institute. (2018) Freedom In the Fifty States. UpToDate. Retrieved March 23, 2020, from https://www.freedominthe50states.org/how-its-calculated

MuonNeutrino. (2015). US Census Demographic Data: Demographic and Economic Data for Tracts and Counties. UpToDate. Retrieved March 23, 2020, from https://www.kaggle.com/muonneutrino/us-census-demographic-dataD

U.S. Census Bureau (2019). “Annual Estimates of the Resident Population for the United States, Regions, States, and Puerto Rico: April 1, 2010 to July 1, 2019”. 2010-2019 Population Estimates. United States Census Bureau, Population Division. December 30, 2019. Retrieved January 27, 2020.

U.S. Census Bureau (2017). “American FactFinder - Results”. U.S. Census Bureau. Retrieved 2017-12-13.

U.S. Census Bureau (2013). “2010 Census Summary File 1: GEOGRAPHIC IDENTIFIERS”. American Factfinder. US Census. Retrieved 18 October 2013.